952 research outputs found

    An investigation of the performance portability of OpenCL

    Get PDF
    This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5× slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3–3.1× slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3× speed-up over our original OpenCL implementation

    Predictive analysis of a hydrodynamics application on large-scale CMP clusters

    Get PDF
    We present the development of a predictive performance model for the high-performance computing code Hydra, a hydrodynamics benchmark developed and maintained by the United Kingdom Atomic Weapons Establishment (AWE). The developed model elucidates the parallel computation of Hydra, with which it is possible to predict its runtime and scaling performance on varying large-scale chip multiprocessor (CMP) clusters. A key feature of the model is its granularity; with the model we are able to separate the contributing costs, including computation, point-to-point communications, collectives, message buffering and message synchronisation. The predictions are validated on two contrasting large-scale HPC systems, an AMD Opteron/ InfiniBand cluster and an IBM BlueGene/P, both of which are located at the Lawrence Livermore National Laboratory (LLNL) in the US. We validate the model on up to 2,048 cores, where it achieves a > 85% accuracy in weak-scaling studies. We also demonstrate use of the model in exposing the increasing costs of collectives for this application, and also the influence of node density on network accesses, therefore highlighting the impact of machine choice when running this hydrodynamics application at scale

    Resident block-structured adaptive mesh refinement on thousands of graphics processing units

    Get PDF
    Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87x faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA

    Optimisation of patch distribution strategies for AMR applications

    Get PDF
    As core counts increase in the world's most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application

    MINIO : an I/O benchmark for investigating high level parallel libraries

    Get PDF
    Input/output (I/O) operations are one of the biggest challenges facing scientific computing as it transitions to exascale. The traditional software stack – com- prising of parallel file systems, middlewares and high level libraries – has evolved to enable applications to better cope with the demands of enormous datasets. This software stack makes high performance parallel I/O easily accessible to application engineers, however it is important to ensure best performance is not compromised through attempts to enrich these libraries. We present MINIO, a benchmark for the investigation of I/O behaviour focusing on understanding overheads and inefficiencies in high level library usage. MINIO uses HDF5 and TyphonIO to explore I/O at scale using different application behavioural pat- terns. A case study is performed using MINIO to identify performance limiting characteristics present in the TyphonIO library as an example of performance discrepancies in the I/O stack

    Performance optimisation of inertial confinement fusion codes using mini-applications

    Get PDF
    Despite the recent successes of nuclear energy researchers, the scientific community still remains some distance from being able to create controlled, self-sustaining fusion reactions. Inertial Confinement Fusion (ICF) techniques represent one possible option to surpass this barrier, with scientific simulation playing a leading role in guiding and supporting their development. The simulation of such techniques allows for safe and efficient investigation of laser design and pulse shaping, as well as providing insight into the reaction as a whole. The research presented here focuses on the simulation code EPOCH, a fully relativistic particle-in-cell plasma physics code concerned with faithfully recreating laser-plasma interactions at scale. A significant challenge in developing large codes like EPOCH is maintaining effective scientific delivery on successive generations of high-performance computing architecture. To support this process, we adopt the use of mini-applications -- small code proxies that encapsulate important computational properties of their larger parent counterparts. Through the development of a mini-application for EPOCH (called miniEPOCH), we investigate a variety of the performance features exhibited in EPOCH, expose opportunities for optimisation and increased scientific capability, and offer our conclusions to guide future changes to similar ICF codes

    Enabling portable I/O analysis of commercially sensitive HPC applications through workload replication

    Get PDF
    Benchmarking and analyzing I/O performance across high performance computing (HPC) platforms is necessary to identify performance bottlenecks and guide effective use of new and existing storage systems. Doing this with large production applications, which can often be commercially sensitive and lack portability, is not a straightforward task and the availability of a representative proxy for I/O workloads can help to provide a solution. We use Darshan I/O characterization and the MACSio proxy application to replicate five production workloads, showing how these can be used effectively to investigate I/O performance when migrating between HPC systems ranging from small local clusters to leadership scale machines. Preliminary results indicate that it is possible to generate datasets that match the target application with a good degree of accuracy. This enables a predictive performance analysis study of a representative workload to be conducted on five different systems. The results of this analysis are used to identify how workloads exhibit different I/O footprints on a file system and what effect file system configuration can have on performance

    Mini-app driven optimisation of inertial confinement fusion codes

    Get PDF
    In September 2013, the large laser-based inertial confinement fusion device housed in the National Ignition Facility at Lawrence Livermore National Laboratory, was widely acclaimed to have achieved a milestone in controlled fusion – successfully initiating a reaction that resulted in the release of more energy than the fuel absorbed. Despite this success, we remain some distance from being able to create controlled, self-sustaining fusion reactions. Inertial Confinement Fusion (ICF) represents one leading design for the generation of energy by nuclear fusion. Since the 1950s, ICF has been supported by computing simulations, providing the mathematical foundations for pulse shaping, lasers, and material shells needed to ensure effective and efficient implosion. The research presented here focuses on one such simulation code, EPOCH, a fully relativistic particle-in-cell plasma physics code, developed by a leading network of over 30 UK researchers. A significant challenge in developing large codes like EPOCH is maintaining effective scientific delivery on successive generations of high-performance computing architecture. To support this process, we adopt the use of mini-applications – small code proxies that encapsulate important computational properties of their larger parent counterparts. Through the development of a miniapp for EPOCH (called miniEPOCH), we investigate known timestep scaling issues within EPOCH and explore possible optimisations: (i) Employing loop fission to increase levels of vectorisation; (ii) Enforcing particle ordering to allow the exploitation of domain specific knowledge and, (iii) Changing underlying data storage to improve memory locality. When applied to EPOCH, these improvements represent a 2.02× speed-up in the core algorithm and a 1.55× speed-up to the overall application runtime, when executed on EPCC’s Cray XC30 ARCHER platform

    A randomised feasibility study of serial magnetic resonance imaging to reduce treatment times in Charcot neuroarthropathy in people with diabetes (CADOM): A protocol

    Get PDF
    Background Charcot neuroarthropathy is a complication of peripheral neuropathy associated with diabetes which most frequently affects the lower limb. It can cause fractures and dislocations within the foot, which may progress to deformity and ulceration. Recommended treatment is immobilisation and offloading, with a below knee non-removable cast or boot. Duration of treatment varies from six months to more than one year. Small observational studies suggest that repeated assessment with Magnetic Resonance Imaging improves decision making about when to stop treatment, but this has not been tested in clinical trials. This study aims to explore the feasibility of using serial Magnetic Resonance Imaging without contrast in the monitoring of Charcot neuroarthropathy to reduce duration of immobilisation of the foot. A nested qualitative study aims to explore participants’ lived experience of Charcot neuroarthropathy and of taking part in the feasibility study. Methods We will undertake a two arm, open study, and randomise 60 people with a suspected or confirmed diagnosis of Charcot neuroarthropathy from five NHS, secondary care multidisciplinary Diabetic Foot Clinics across England. Participants will be randomised 1:1 to receive Magnetic Resonance Imaging at baseline and remission up to 12 months, with repeated foot temperature measurements and x-rays (standard care plus), or standard care plus with additional three-monthly Magnetic Resonance Imaging until remission up to 12 months (intervention). Time to confirmed remission of Charcot neuroarthropathy with off-loading treatment (days) and its variance will be used to inform sample size in a full-scale trial. We will look for opportunities to improve the protocols for monitoring techniques and the clinical, patient centred, and health economic measures used in a future study. For the nested qualitative study, we will invite a purposive sample of 10-14 people able to offer maximally varying experiences from the feasibility study to take part in semi-structured interviews to be analysed using thematic analysis. Discussion The study will inform the decision whether to proceed to a full-scale trial. It will also allow deeper understanding of the lived experience of Charcot neuroarthropathy, and factors that contribute to engagement in management and contribute to the development of more effective patient centred strategies. Trial registration ISRCTN, ISRCTN, 74101606. Registered on 6 November 2017, http://www.isrctn.com/ISRCTN74101606?q=CADom&filters=&sort=&offset=1&totalResults=1&page=1&pageSize=10&searchType=basic-searc

    Parallel block structured adaptive mesh refinement on graphics processing units.

    Get PDF
    Block-structured adaptive mesh refinement is a technique that can be used when solving partial differential equations to reduce the number of zones necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a native GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an eight-node cluster, and over four thousand nodes of Oak Ridge National Laboratory’s Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87x faster than the CPU-based implementation, and has been scaled to over four thousand GPUs using a combination of MPI and CUDA
    corecore